Advanced Corpus Solutions for Humanities Researchers

نویسندگان

  • James Wilson
  • Anthony Hartley
  • Serge Sharoff
  • Paul Stephenson
چکیده

This paper describes the design and implementation of an interface to corpora in 12 languages, stemming from the analysis of the needs of a diverse group of users: language teachers and language students, (non-computational) linguists, researchers in history and translation studies. We identified a set of requirements shared across the disciplines, as well as more specific requirements from the targeted user groups. The interface is designed to handle large-scale corpora of 20-500 million words.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Supporting Serendipitous and Focused Search

People with complex information needs are for example Humanities researchers, who need advanced search engines to investigate their research questions. Much can be gained by combining research datasets, reusing tools and serendipitously discovering new insights for further research. Humanities researchers have different (large-scale) research datasets and tools, which are described differently ...

متن کامل

GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus

This paper introduces a software tool, GutenTag, which is aimed at giving literary researchers direct access to NLP techniques for the analysis of texts in the Project Gutenberg corpus. We discuss several facets of the tool, including the handling of formatting and structure, the use and expansion of metadata which is used to identify relevant subcorpora of interest, and a general tagging frame...

متن کامل

Vocabulary Lists for EAP and Conversation Students

Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...

متن کامل

A multi-level multimedia concordancer for spoken language corpora (Un concordancier multi-niveaux et multimédia pour des corpus oraux) [in French]

Concordances have always played an important role in the analysis of language corpora, for studies in humanities, literature, linguistics, translation and language teaching. However, very few of the available systems support multi-level queries against a richly-annotated, sound-aligned spoken corpus. The rapid growth in the development of spoken corpora, particularly for French, increases the n...

متن کامل

Enhancing Access to Media Collections and Archives Using Computational Linguistic Tools

In this paper, we outline the strategies, methodology, and infrastructure needed to bring advanced computational linguistic tools to researchers and archivists in the humanities. We discuss three use cases involving the application of the Language Application Grid (LAPPS), an open, web-based infrastructure providing interoperable access to hundreds of computational linguistic (CL) component web...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010